These notes are based on Chapter 2.7 & 7.4-7.6 of Kroese, D. P., & Chan, J. C. (2014). Statistical modeling and computation. New York: Springer.
To use Monte Carlo methods, first we specify random variables and then sample from them. In the examples of Monte Carlo integration, we use only uniform and exponential distributions, so sampling is not a problem; we can just call R built-in samplers (runif or rexp).
In general, this is not always the case. For example, in Bayesian Statistics, after computing the posterior PDF using Bayes’ formula, we are most likely interested in its mean. However, as you can imagine by noticing the products and division involved in the formula, posterior PDFs can easily become unwieldy. Therefore, we would like to use MC integration, but how can we sample from those unwieldy PDFs that do not resemble any common PDF?
In these notes, we cover three sampling methods in ascending order of sophistication:
The method is applicable only to cases in which we can invert the (part of) CDF of \(X\). If applicable, it is fast and reliable.
To illustrate the idea, consider the case where \(F\) as the CDF of \(X\) is invertible (e.g., \(X\sim N(0,1)\)). Given \(U\sim U(0,1)\), then we have a random variable \(Y=F^{-1}(U)\) which has the same CDF as \(X\). That is, given a sample \(u\) of \(U\), \(y=F^{-1}(u)\) is a sample of \(X\). Here is how:
\[\begin{align} P(Y\leq y) &= P(F^{-1}(U)\leq y)\\ &= P(U\leq F(y))\\ &= F(y) \end{align}\]
par(mar=c(2,.5,1,.5))
f <- function(x) 1/(1+exp(-(x-2)))
curve(1/(1+exp(-(x-2))), from=-2, to=7, axes=FALSE, ann=FALSE)
axis(side=1, at=c(-4,0,2.5,8), labels=c("","0","X",""))
abline(v=0)
lines(c(0,7), c(1,1), lty=3)
arrows(0,f(2.5),2.5,f(2.5),length=.1,angle=12)
arrows(2.5,f(2.5),2.5,-.01,length=.1,angle=12)
text(0,f(2.5), "U", pos=2)
text(0,1, "1", pos=2)
Note that the definition \(Y=F^{-1}(U)\) implies \(U=F(Y)\). Since \(X\) and \(Y\) have the same distribution and \(X\) is an arbitrary continuous random variable, we have \(U=F(X)\) — if we transform an arbitrary \(X\) using its CDF, we have a random variable of \(U(0,1)\).
The CDFs of many, if not most, random variables are technically not invertible. Remember that a random variable \(X\) is a function whose codomain is the entire real line \(\mathbb{R}\). Thus, to be invertible, \(F\) must be strictly increasing over \(\mathbb{R}\), Equivalently, the support of \(X\) must be \(\mathbb{R}\). This condition excludes many random variables including Exponential and Gamma whose support is only the half real line \([0,\infty)\).
However, the CDF for \(X\) with the support \([0,\infty)\) is almost invertible; if we define \(G\) to be equal to \(F\) with the domain \([0,\infty)\) instead of \((-\infty,\infty)\), then \(G\) is invertible and we can generate a sample \(G^{-1}(u)\) of \(X\).
When \(X\) is discrete, \(F\) is a step function as below.
To deal with a more general case, we may use the following \(H^{-1}:(0,1)\to\mathbb{R}\) to generate a sample \(H^{-1}(u)\) of \(X\): \[ H^{-1}: u \mapsto \inf\{x : F(x)\geq u\} .\]
Notice that \(H^{-1}\) is a generalisation of \(F^{-1}\) or \(G^{-1}\); \(H^{-1}\) still works in those special cases.
In practice, when \(X\) has the finite support, we can use a R built-in function
sample(x, prob=NULL, ...)to which we pass the PMF asprob.
Let’s generate samples from \(\text{exp}(1)\) whose CDF is \(F(x)=1-e^{-x}\) for \(x\geq0\) and \(0\) otherwise. Solving the following \[\begin{alignat}{2} && u &= 1-e^{-x}\\ &\Leftrightarrow\quad & x &= -\log(1-u)\\ \end{alignat}\] we find \(G^{-1}(u) = -\log(1-u)\).
Below is a histogram together with the actual PDF.
n <- 1e6
u <- runif(n)
x_exp <- -log(u) # U and 1-U have the same distribution
x <- x_exp[x_exp<6] # Truncate for visualisation
hist(x, seq(0,6,.1), freq=FALSE, ylim=c(0,1), ylab="f(x)", main="PDF of exp(1)")
curve(exp(-x), from=0, lwd=2, add=TRUE)
Even when the inverse of \(F\) exists, inverting \(F\) is often analytically impossible or numerically too expensive. In these cases, we may try the acceptance-rejection method.
The acceptance-rejection method is based on the following intuition.
Obtain a random sample \((x,y)\) using uniform sampling from the region formed by the \(x\)-axis and the density curve of \(X\). Then, keep \(x\) as a sample from \(X\).
The beauty of this method is that we need not directly sample from the target PDF, but only uniformly sample from the region under the graph of the target PDF.
Recall that, by uniform sampling from a set \(A\in\mathbb{R}^d\), we mean a random variable (\(d=1\)) or vector (\(d>1\)) \(X\) with the support \(A\) and the (joint) PDF \(f(x) = \mathbb{1}_A(x)/|A|\).
How can we obtain random samples uniformly taken from that region? A trick is to find another PDF \(g\) that satisfies the following conditions:
Given such \(g\), first sample \(x\) from \(g\), and then uniformly sample from \([0, Cg(x)]\). If we keep only samples that fall under the curve \(f(x)\), those samples \(\{(x,y):0<y\leq f(x)\}\) are uniformly taken from the desired region.
The claim is intuitive but may not be obvious. So, two key facts are stated as lemmas below.
Let’s generate samples from \(X\sim N(0,1)\). There are two observations to make.
Since the PDF is symmetric around 0, once we generate positive samples, we can randomly flip their signs to generate negative samples. The PDF for the positive half can be derived by re-normalising the original PDF: \[ f(x) = \begin{cases} \sqrt{\frac{2}{\pi}}e^{-x^2/2} & x\geq0\\ 0 & \text{otherwise} \end{cases} . \]
Second, we can bound the above \(f\) by \(Cg(x)\) where \(g\) is the PDF for \(\text{exp}(1)\) and \(C=\sqrt{2e/\pi}\).
Notice \(C = \max_x \frac{f(x)}{g(x)}\).
curve(sqrt(2*exp(1)/pi)*exp(-x), from=0, to=4, ylab="f(x)", lty=3)
curve(sqrt(2/pi)*exp(-x^2/2), from=0, to=4, add=TRUE)
title(bquote(f(x) ~ "bounded by" ~ Ce^-x))
legend(2.9, 1.2, lty=c(1,3),
legend=c("Half normal", expression(Ce^-x)))
Here is a result. (Note that we reuse the samples from \(\text{exp}(1)\) generated by the inverse transform method above.)
half.normal <- function(x) sqrt(2/pi)*exp(-x^2/2)
C <-sqrt(2*exp(1)/pi)
y <- runif(length(x_exp))*C*exp(-x_exp) # Reuse x_exp from exp(1)
x_pos <- x_exp[y <= half.normal(x_exp)] # Keep those under the target PDF
ind <- 1:floor(length(x_pos)/2) # Half sample for negative values
x_neg <- -x_pos[ind]
x_pos <- x_pos[-ind]
x <- c(x_neg[x_neg>-4], x_pos[x_pos<4]) # truncate for visualisation
res <- hist(x, 100, freq=FALSE, ylab="f(x)", xlim=c(-4,4), main="PDF of N(0,1)")
curve(max(res$density)*exp(-x^2/2), lwd=2, add=TRUE)
A uniform sample \((x,y)\) from the region under the curve \(Cg(x)\) can be obtained by first sampling \(x\) from the PDF \(g(x)\) and then sampling \(y\) from \(\text{U}(0,Cg(x))\) given \(x\).
Let \(h\) be the conditional PDF for \(\text{U}(0,Cg(x))\) given \(x\): \[ h(y|x) = \begin{cases} \frac{1}{Cg(x)} & y\in[0,Cg(x)]\\ 0 & \text{otherwise} \end{cases} . \] Then, observe the following joint PDF of \((x,y)\): \[\begin{align} f(x,y) &= h(y|x)g(x)\\ &= \frac{1}{Cg(x)}g(x)\\ &= \frac{1}{C}, \end{align}\] which implies the uniform (joint) distribution.
Let \(A,B\in\mathbb{R}^2\) such that \(A\subseteq B\). If \(\{x_n\} = x_1,x_2,\dots\) is an IID sequence obtained by uniform sampling from \(B\). Then, a subsequence \(\{x_{n_k}\} = x_{n_1},x_{n_2},\dots\) such that \(x_{n_k}\in A\) for all \(k\geq1\) is an IID sequence obtained by uniform sampling from \(A\).
When you watch the animation in Estimating \(\pi\), don’t you think dots look uniformly distributed inside the quarter disk?
First, consider the probability that \(x\) falls inside a small region \(dA\) inside \(A\). \[\begin{align} P(x\in dA |x\in A) &= \frac{P(x\in(dA\cap A))}{P(x\in A)}\\ &= \frac{P(x\in dA)}{P(x\in A)}\\ &= \frac{\int_{dA}fdx}{\int_{A}fdx} \end{align}\]
Then, by the fundamental theorem of calculus, the conditional density is: \[\begin{align} f(x|x\in A) &= \lim_{dA\to 0}P(x\in dA | x\in A)\\ &= \lim_{dA\to 0}\frac{\int_{dA}fdx}{\int_{A}fdx}\\ &= \frac{f}{\int_{A}fdx}\\ \end{align}\]
An implication is that, if \(x\) is uniformly distributed over \(B\), \[\begin{align} f(x|x\in A) &= \frac{\frac{1}{|B|}}{\frac{|A|}{|B|}}\\ &= \frac{1}{|A|}\\ \end{align}\]
which is a constant and integrates to \(1\) over \(A\).
Hence, a subset of uniform samples from \(B\) that also belong to \(A\) are indeed uniformly distributed over \(A\).
In principle, the acceptance-rejection method can work in high dimensions if we can find a PDF \(g(x)\) from which we can efficiently sample and \(C\) for a tight envelope \(Cg(x)\). It turns out this is a big IF.
In 1-D case, looking at the graph of target \(f(x)\), we are often able to design a reasonable \(Cg(x)\). Even if an envelope is not very tight, we can wait for some time and obtain enough samples from \(U(0,Cg(x))\) that are smaller than \(f(x)\).
This is unlikely the case in high dimensions. Our visual images in a 4-D or higher space are poor (at least for me). In addition, the number of enough samples exponentially increases with dimensions, and we may not be allowed to wait for hours or days.
Markov chain Monte Carlo (MCMC) is a general class of methods to sequentially generate samples from an approximate distribution that increasingly resembles a target distribution. A sequence of samples are generated by running a Markov chain, which is designed to have the limiting distribution equal to the target distribution. By being content with only approximate distributions, MCMC enables us to handle functions on a high-dimensional space, which are common in modern complex problems.
Since Monte Carlo simply refers to stochastic simulation, we might just call it Markov chain simulation. But, MCMC has sort of become the standard name everybody using statistical computing understands, perhaps because the acronym is just catchy.
To present practical tools in a small space, we gloss over much of the Markov chain theory, which is quite rich and fascinating!
A Markov chain is a sequence of random variables \(X_1,X_2\dots\) whose futures are conditionally independent of the past given the present. That is, for all \(i\geq0\), \[(X_{i+1}|X_i,X_{i-1},\dots) \sim (X_{i+1}|X_i) .\]
In an IID sequence (e.g., Bernoulli trials), futures are independent even of the present.
For MCMC applications, we are interested in continuous random variables \(\{X_i\}\). In this case, for \(A\subset\mathbb{R}^d\), \[P(X_{i+1}\in A|X_i=x_i,X_{i-1}=x_{i-1},\dots) = P(X_{i+1}\in A|X_i=x_i) .\]
In particular, we restrict ourselves to Markov chains for which the conditional PDFs \(f_{X_{i+1}|X_i}(y|x)\) do not depend on \(i\) and denote it by \(q\). That is, \[q(y|x) = f_{X_{i+1}|X_i}(y|x)=f_{X_1|X_0}(y|x)\] for all \(i\geq0\). We call \(q\) the transition density of the Markov chain.
A probability distribution is called stationary if its PDF \(f\) satisfies the following global balance equations: \[f(x_{i+1}) = \int f(x_{i})q(x_{i+1}|x_{i})dx_{i} .\]
You may see \(f(x_{i})q(x_{i+1}|x_{i})\) as the joint distribution of \(X_i\) and \(X_{i+1}\). If we integrate out \(X_i\), we have the marginal distribution for \(X_{i+1}\) as usual. In this special dependency, however, the marginal for \(X_{i+1}\) turns out the same as the one for \(X_i\).
Given the initial value \(x_0\), if we can draw samples from \(q\), then we can simulate a Markov chain and sequentially obtain samples \(x_0,x_1,\dots,x_n\).
The rationale for Markov chain as a method for sampling from our desired distribution \(f\) is:
Samples from the first \(I\) steps (so-called “burn-in” period) are often discarded.
Hence, our task in MCMC is to design a good Markov chain.
Ergodicity is the condition for a Markov chain to converge to a unique stationary distribution irrespective of \(x_0\). It consists of two intuitive properties: irreducibility and aperiodicity.
Irreducibility ensures that every value \(x\) can be visited from every other value within finite steps \((i<\infty)\). This is important because, in MCMC, a sequence of samples are generated by dependent sampling according to the transition density \(q\), and some \(x_{i+1}\) may not be reached directly from the current \(x_i\), i.e., \(q(x_{i+1}|x_i)=0\). Irreducibility guarantees that such \(x\) will eventually be reached.
Aperiodicity prevents indefinite oscillation and allows the chain to converge to a stationary distribution.
See Roberts & Rosenthal (2004) for more details.
Among the ergodic Markov chains, we could in principle choose a transition density \(q\) that satisfies the global balance equations to ensure that the chain converges to our target \(f\). In practice, however, this is tricky and challenging. So, we typically impose stronger structure called the detailed balance equations. \[f(x)q(y|x) = f(y)q(x|y), \;\forall x,y.\]
By integrating both sides over \(x\), we see the detailed balance equations imply the global balance equations.
Here is an example of MCMC for sampling from a bivariate normal distribution \(N(\mu,\Sigma)\) with \[ \mu=0 \quad\text{and}\quad \Sigma=\begin{pmatrix}1 & 0.8\\0.8 & 1\\\end{pmatrix} .\]
In practice, of course, nobody should use MCMC for sampling from normal distribution. Inatead, use function
mvrnormfromMASSpackage.
Remember that the joint PDF looks like this:
r <- 0.8
f <- function(x,y) {
return( 1/(2*pi*sqrt(1-r^2))*exp(-1/(2*(1-r^2))*(x^2+y^2-2*r*x*y)) )
}
X = seq(-3, 3, length=100)
Y = seq(-3, 3, length=100)
XY = expand.grid(X=X, Y=Y) # create a grid
Z <- f(XY$X, XY$Y)
s = interp(x=XY$X, y=XY$Y, z=Z)
plot_ly(x=s$x, y=s$y, z=t(s$z)) %>% add_surface()
Using Gibbs sampler, a result may look like the following.
n <- 300
m <- 5
x <- rep(0,n)
y <- rep(0,n)
x[1] <- 5*runif(1)-2.5
y[1] <- 5*runif(1)-2.5
for (i in 2:n) {
x[i] <- rnorm(1, mean=r*y[i-1], sd=sqrt(1-r^2))
y[i] <- rnorm(1, mean=r*x[i], sd=sqrt(1-r^2))
par(mar=c(5,5,.1,.1))
plot(x[1], y[1], xlab="x", ylab="y", xlim=c(-3,3), ylim=c(-3,3), pch=4)
if (i <= m) {
segments(x[-i],y[-i],x[-1],y[-1],lwd=.5)
} else {
points(x[2:(i-(m-1))], y[2:(i-(m-1))], pch=20, cex=.3)
segments(x[(i-m):(i-2)], y[(i-m):(i-2)],
x[(i-(m-1)):(i-1)], y[(i-(m-1)):(i-1)], lwd=.5)
}
arrows(x[i-1],y[i-1],x[i],y[i], length=0.08, angle=20, lwd=1)
}
Note the dependency of each new point on the previous point; each arrow goes only so far. But, collectively, points are distributed as if sampled from the target distribution.
In practice, most of us do not design a transition density \(q\) from scratch; rather, we use proven design templates. The Metropolis-Hastings algorithm provides a good template.
Suppose we have an ergodic Markov chain characterised by a transition density \(q\). The Metropolis-Hastings algorithm tells us how to build upon this Markov chain and design another one that satisfies the detailed balance equations with respect to the target \(f\).
The ergodiciy of the new chain is implied by the ergodiciy of the origianl Markov chain with \(q\).
The idea is similar to the acceptance-rejection method.
The acceptance probability is defined as follows: \[\alpha(x,y) = \min\left(1, \frac{f(y)q(x|y)}{f(x)q(y|x)}\right) .\]
An intuition is the following. If we have a very good proposal density \(q\), which is very similar to the target function \(f\), we will have both ratios \(f(y)q(y|x)\) and \(f(x)q(x|y)\) close to 1, leading to \(\alpha\) close to 1. Hence, most of the time, we accept proposals.
The new transition density for \(x_{i+1}\neq x_i\) is \[\tilde{q}(x_{i+1}|x_i) = q(x_{i+1}|x_i)\alpha(x_i,x_{i+1}) .\] Less importantly, the probability of not moving is \[P(X_{i+1}=x_i|X_i=x_i) = 1 - \int_{y\neq x_i} q(y|x_i)\alpha(x_i,y)dz .\] Remember the detailed balance equations: \[f(x)q(y|x) = f(y)q(x|y), \;\forall x,y.\] First, notice that, regardless of \(q\), it trivially holds for \(y=x\). This is why \(X_{i+1}=x_i\) is unimportant.
Next, for \(y\neq x\), \[\begin{align} f(x)\tilde{q}(y|x) &= f(x)q(y|x)\alpha(x,y)\\ &= f(x)q(y|x)\min\left(1, \frac{f(y)q(x|y)}{f(x)q(y|x)}\right)\\ &= \min\left(f(x)q(y|x), f(y)q(x|y)\right), \end{align}\] which is symmetric between \(x\) and \(y\) (i.e., \(\min(\cdot,\cdot)\) returns the same value even if we switch \(x\) and \(y\)). Hence, \(f(y)\tilde{q}(x|y)\) have the same value equal to \(f(x)\tilde{q}(y|x)\). Now, the detailed balance equations follow.
The Gibbs sampler is a special case of the Metropolis-Hastings algorithm and widely used to address multidimensional problems. It is particularly suitable for Bayesian hierarchical models due to the modularised model construction.
Sec 11.3 of Gelman et al. (2013) contains an explanation of how it can be viewed as a special case of the Metropolis-Hastings algorithm.
Suppose \(x\) is a vector in a multidimensional space and divided into \(d\) subvectors \(x=(x^{(1)},\dots,x^{(d)})\). (N.B. Each subvector need not be in one dimension.)
Let \(f^{(k)}\) be a conditional PDF of \(x^{(k)}\) \((k\in\{1,\dots,d\})\) given the values of the other subvectors: \[f^{(k)}(x^{(k)}|x^{(1)},\dots,x^{k-1},x^{k+1},\dots,x^{(d)}) .\] The Gibbs sampler assumes we can sample from each of these conditional PDFs, and proceeds as follows.
I think you get the idea. \(x_{i+1} = (x_{i+1}^{(1)},\dots,x_{i+1}^{(d)})\) is technically a proposal in the context of the Metropolis-Hastings algorithm, but \(x_{i+1}\) is always accepted in the Gibbs sampler.
Let’s do simple Bayesian statistics. Assume the following:
The joint distribution of \(\lambda^{(1)},\dots,\lambda^{(d)}\) and \(\theta\) is complex and could be high-dimensional for large \(d\). However, each conditional density belongs to a common family, namely Gamma distribution. Hence, sampling is readily done using the Gibbs sampler.